EN FR
EN FR
Bilateral Contracts and Grants with Industry
Bibliography
Bilateral Contracts and Grants with Industry
Bibliography


Section: New Results

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Participants : Mark Schmidt [correspondent] , Nicolas Le Roux [correspondent] .

In [33] we consider optimizing a function smooth convex function f that is the average of a set of differentiable functions fi, under the assumption considered by  [87] and  [90] that the norm of each gradient fi' is bounded by a linear function of the norm of the average gradient f'. We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an O(1/k) convergence rate, and has a linear convergence rate if g is strongly-convex.

We write our problem

min x P f ( x ) : = 1 N i = 1 N f i ( x ) , (2)

where we assume that f is convex and its gradient f' is Lipschitz-continuous with constant L, meaning that for all x and y we have

| | f ' ( x ) - f ' ( y ) | | L | | x - y | | .

If f is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian f''(x) are bounded between 0 and L for all x.

Deterministic gradient methods for problems of this form use the iteration

x k + 1 = x k - α k f ' ( x k ) , (3)

for a sequence of step sizes αk. In contrast, stochastic gradient methods use the iteration

x k + 1 = x k - α k f i ' ( x k ) , (4)

for an individual data sample i selected uniformly at random from the set {1,2,,N}.

The stochastic gradient method is appealing because the cost of its iterations is independent of N. However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes {αk} and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration k of O(1/k)  [78] , meaning that

𝔼 [ f ( x k ) ] - f ( x * ) = O ( 1 / k ) .

In contrast, the deterministic gradient method with a constant step size has a smaller error of O(1/k)  [79] . The situation is more dramatic when f is strongly convex, meaning that

f ( y ) f ( x ) + f ' ( x ) , y - x + μ 2 | | y - x | | 2 , (5)

for all x and y and some μ>0. For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by μ. For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of O(1/k)  [77] while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies

f ( x k ) - f ( x * ) ρ k [ f ( x 0 ) - f ( x * ) ] ,

for some ρ<1  [71] .

We show that if the individual gradients fi'(xk) satisfy a certain strong growth condition relative to the full gradient f'(xk), the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.